Collagen diseases are a group of conditions that affect the connective tissues in the body, which provide structure and support to organs and tissues. These diseases can range from mild to severe and include conditions like lupus, rheumatoid arthritis, and scleroderma. In these diseases, thrombosis is one of the most important and severe complications, one of the major cause of death in collagen diseases. Recently, this complication is closely related with anti-cardiolipin antibodies, which was discovered by the medical physicians, one of whom donated the datasets for discovery challenge.
The seriousness of Thrombosis has made it one of the most important diseases being explored in the medical industry, for properly understanding it and potentially identifying its predictors would provide a breakthrough that made save thousands of lives.
That being said, Georgetown University has provided me with data from their university hospital to conduct an analysis on thrombosis and its possible contributors. This data includes information on patient demographics, medical history, and laboratory results, which will be used to identify potential risk factors and underlying causes of thrombosis.
The main objectives of this project are:
Identify potential predictors of Thrombosis.
Analyze any potential temporal patterns associated with Thrombosis.
Understand the characteristics of a patient with Thrombosis.
To achieve these objectives, the data will be carefully analyzed to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, this project aims to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions.
2 Data Summary
The Hospital has provided with 3 main datasets, brief descriptions of each can be found below:
tsumoto_a.csv: Basic Information about Patients
tsumoto_b.csv: Special Laboratory Examinations
tsumoto_c.csv: Laboratory Examinations stored in Hospital Information Systems
Before doing any kind of analysis, it is important to properly understand the datasets in terms of shape, strengths, weaknesses, etc.
Importing all necessary libraries.
Code
# Importing all necessary packagesimport statsmodels.api as smimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport klibimport missingno as msnofrom sklearn.preprocessing import StandardScalerfrom math import pifrom wordcloud import WordCloud, STOPWORDSimport sys
Importing the 3 datasets. As a quick note, a couple of lines in tsumoto_C were causing issues when reading the dataset, so a “error_bad_lines=False” argument was introduced to bypass these issues. A total of 30 lines (out pf 50k + where dropped), so any impacts of this drop would be very minimal.
Code
# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')# Importing datasetsdf_A = pd.read_csv("data/TSUMOTO_A.csv" , encoding='ISO-8859-1')df_B = pd.read_csv("data/TSUMOTO_B.csv" , encoding='ISO-8859-1')df_C = pd.read_csv("data/TSUMOTO_C.csv" , encoding='ISO-8859-1' , error_bad_lines=False) # A couple of bad (unreadable) lines in tsumoto_c where dropped. The dropped lines account for 30 (out of 50k+),# which makes this drop insignificant
Quick look at each data set:
Code
df_A.head(5)
ID
SEX
Birthday
Description
First Date
Admission
Diagnosis
0
2110
F
1934/2/13
94.02.14
93.02.10
+
RA susp.
1
11408
F
1937/5/2
96.12.01
73.01.01
+
PSS
2
12052
F
1956/4/14
91.08.13
NaN
+
SLE
3
14872
F
1953/9/21
97.08.13
NaN
+
MCTD
4
27654
F
1936/3/25
NaN
92.02.03
+
RA, SLE susp
Code
df_B.head(5)
ID
Examination Date
aCL IgG
aCL IgM
ANA
ANA Pattern
aCL IgA
Diagnosis
KCT
RVVT
LAC
Symptoms
Thrombosis
0
14872.0
1997/5/27
1.3
1.6
256
P
0.0
MCTD, AMI
NaN
NaN
-
AMI
1
1
48473.0
1992/12/21
4.3
4.6
256
P,S
3.3
SLE
-
-
-
NaN
0
2
102490.0
1995/4/20
2.3
2.5
0
NaN
3.5
PSS
NaN
NaN
NaN
NaN
0
3
108788.0
1997/5/6
0.0
0.0
16
S
0.0
NaN
NaN
NaN
-
NaN
0
4
122405.0
1998/4/2
0.0
4.0
4
P
0.0
SLE, SjS, vertigo
NaN
NaN
NaN
NaN
0
Code
df_C.head(5)
ID
Date
GOT
GPT
LDH
ALP
TP
ALB
UA
UN
...
C3
C4
RNP
SM
SC170
SSA
SSB
CENTROMEA
DNA
DNA-II
0
2110
860419
24
12
152
63
7.5
4.5
3.4
16.0
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
2110
860430
25
12
162
76
7.9
4.6
4.7
18.0
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2110
860502
22
8
144
68
7
4.2
5
18.0
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
3
2110
860506
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
2110
860507
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
5 rows × 44 columns
At first glance, the datasets seem to have a couple of missing values, something which will be dealt with later on.
Code
"Lenght of df_A: "+str(df_A.shape)
'Lenght of df_A: (1240, 7)'
Code
"Lenght of df_B: "+str(df_B.shape)
'Lenght of df_B: (806, 13)'
Code
"Lenght of df_C: "+str(df_C.shape)
'Lenght of df_C: (57510, 44)'
The lengths of the data sets makes sense considering their nature, where tsumoto_C is by far the largest one considering it is for general laboratory examinations, where each user has more than one entry (multiple examinations per user).
Before moving on with any analysis, renaming the columns to make the dataset tidier is a good idea. The idea is to have all columns in lower case and with “_” (underscore) instead of spaces.
def format_date_column(df, col_name): formatted_dates = []for date in df[col_name]: y =str(date)[0:2] m =str(date)[2:4] d =str(date)[4:6] formatted_dates.append('19'+ y +'/'+ m +'/'+ d) df[col_name] = formatted_datesreturn dfdf_C = format_date_column(df_C, 'date')
A huge part of this project focuses on joining these data sets and subsets of these datasets, so checking for the missing values per column (per data set) would give a good sense of what columns are useless and should be possibly dropped.
Code
# Missing values plot df_A# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')klib.missingval_plot(df_A , cmap='Reds')plt.figure(figsize=(8, 6))plt.show()
<Figure size 768x576 with 0 Axes>
All columns in df_A seem fine, description and first date (date of admission) are the columns with the highest rate of missing values (15% and 20%), but that shouldnt cause any troubles. Dealing with missing values in these columns could be dealt with at later stages, when/if they are used.
Code
# Missing values plot df_A# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')klib.missingval_plot(df_B, cmap='Reds')plt.figure(figsize=(8, 6))plt.show()
<Figure size 768x576 with 0 Axes>
It appears that df_B is more concerning than df_A as it contains 4 columns with more than 70% (some even 80%) of the values missing. These columns, namely kct, rvvt, lac, and symptoms, pose a significant challenge in terms of replacing missing values due to various reasons. Since only around 20% of the columns have values, it’s difficult to determine how the missing values are distributed and how they might impact the data. Attempting to replace these missing values would introduce bias and skewness in the data, leading to results that may not accurately represent reality. Therefore, the best course of action would be to drop these columns and exclude them from further analysis.
Code
# Missing values plot df_A# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')klib.missingval_plot(df_C, cmap='Reds')plt.figure(figsize=(8, 6))plt.show()
<Figure size 768x576 with 0 Axes>
As for df_C, which turns out to be the most torubling dataset, with almost haf (19 out of 44) of the columns have more than 60% as missing values, of which 15 columns have more than 80% missing values. The main issue is that the majority of these columns are numeric and could have added context to any possible analysis, but the current state of the columns make them useless and very troublesome.
To further understand the reaso behind that much missing data, i will plot missing values correlation heatmaps, which show which variables usually are missing together and the possible implications.
Code
# Missing value correlations df_A# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')mva = msno.heatmap(df_A, cmap='Reds', figsize=(8, 6) , fontsize =10)
Nothing too significant for df_A.
Code
# Missing value correlations df_B# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')mva = msno.heatmap(df_B, cmap='Reds', figsize=(7, 5) , fontsize =10)
The only significant missing values relationship/correlation are between kct and lac as well as rvvt and lac, which makes sense considering they are the same thing (measure of degree of coagulation). Two of these three columns should be dropped considerig they represent the same thing.
Code
# Missing value correlations df_C# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')mva = msno.heatmap(df_C, cmap='Reds', figsize=(7.4, 5.4) , fontsize =10)
The upper half of the above missing values correlation plot is full of correlations in the 0.7-1 range, which could be explained in 2 ways:
There is a lot of missing values in many columns, leading to a high missing values correlation.
Most of these lab test results are done together (same test) and hence skipping one test would impact many results. For example, tp and got (correlation 0.9) could be resulting from the same test.
The above hypotheses would be further understood once missing values are dropped and new heatmaps are plotted, which is part of the next steps.
Finally, since one of the main goals of the research project was understand Thrombosis at a deeper level, it is important to see how many cases are present in the dataset and how they evolved over time.
Code
# Temporal Analysis (How has thrombosis evolved over time)temporal_B = df_B.dropna(subset=['examination_date' , 'thrombosis'])temporal_B = temporal_B.sort_values(by=['examination_date'])temporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])# Replace thrombosis 2 and 3 with 1, since we want to analyze thrombosis in general temporal_B['thrombosis'] = temporal_B['thrombosis'].replace({2: 1, 3: 1})# Extract the year from the datetime and create a new columntemporal_B['year'] = temporal_B['examination_date'].dt.yearthrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()ax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=['#979797', '#f44336'])ax.set_title("Thrombosis case growth over the years")ax.set_xlabel('Year')ax.set_ylabel('Count')
Text(0, 0.5, 'Count')
The bar chart above showcases a big issue with the data, there are very few Thrombosis patients (and even less level 2 and level 3 patients) which could lead to inaccurate conclisons about the disease. For example, if i were to analyze the average IGA of Thrombosis level 3 patients but there are only 6 patients in that category, the results might be very far from the population (All thrombosis level 3) average. This is just a quick example of how this could cause issues, so it is important to highlight this limitation before moving forward with any analysis.
3 Initial Questions
Does Age have any impact on Thrombosis ?
Is Thrombosis more prevalent in Males compared to Females (or vice versa) ?
What are the possible Laboratory tests that could help in predict Thrombosis ?
Are there any prevalent symptoms a person experiences when they have Thrombosis ?
4 Data Munging
As previously discussed, missing values are a big obstacle to the data sets, so dealing with them is a priority before carrying on any analysis. The first rule/assumption i decided on is drop any columns with more than 70% missing values, as the potential bias and skeweness introduced by data imputation is not worth it. Columns closes to the 50-60% missing values will be kept and evaluated on a case by case basis.
Check new column dimensions (after dropping columns)
Code
"Lenght of modified df_A: "+str(df_A.shape)
'Lenght of modified df_A: (1240, 7)'
Code
"Lenght of modified df_B: "+str(df_B.shape)
'Lenght of modified df_B: (806, 9)'
Code
"Lenght of modified df_C: "+str(df_C.shape)
'Lenght of modified df_C: (57510, 24)'
In terms of joining data sets, there are multiple ways of doing so. depending on the question being answered. The first set of questions and analysis i want to do is on Thrombosis itself, understanding how prevalent it is, what are its main symptoms, what might be some predictors, etc.
For that reason, the primary unit of observation will be the patient id and whether they have or do not have Thrombosis. In other words, rows in df_B (based on id) are the unit of observation and they will be joined with the corresponding dataframes.
analytic_1 is the first data set i will use to do a patient based analysis on Thrombosis, so an inner join will be used between df_B and df_A, then the same will be done with df_C. The join will be done on the “id” column between A and B.
Code
# Drop NAs in the ID column of df_B since they are the main unit of analysis (total of 30 rows dropped, not that bad)patients = df_B.dropna(subset=['id'])patients['id'] = patients['id'].astype('int') analytic_1 = pd.merge(df_A, patients, on="id")analytic_1.head(5)
id
sex
birthday
description
first_date
admission
diagnosis_x
examination_date
acl_igg
acl_igm
ana
ana_pattern
acl_iga
diagnosis_y
thrombosis
0
14872
F
1953/9/21
97.08.13
NaN
+
MCTD
1997/5/27
1.3
1.6
256
P
0.0
MCTD, AMI
1
1
48473
F
1948/10/7
97.08.13
NaN
+
SLE
1992/12/21
4.3
4.6
256
P,S
3.3
SLE
0
2
102490
F
1982/4/1
94.02.22
NaN
+
PSS
1995/4/20
2.3
2.5
0
NaN
3.5
PSS
0
3
108788
F
1942/3/15
NaN
NaN
+
SJS
1997/5/6
0.0
0.0
16
S
0.0
NaN
0
4
122405
F
1961/5/22
NaN
NaN
-
SJS
1998/4/2
0.0
4.0
4
P
0.0
SLE, SjS, vertigo
0
Columns decription seems to have a weird format and first_date has a missing values issue, but both could be dealth with later on if they were to be used in any analysis.
As for df_C (laboratory examinations), they will be joined with analytic_1 in a different dataset, since there are a lot of columns in the 3 datasets, joining them all into one would be troublesome and would overcomplicate things.
Code
# Join analytic1 wuith df_C in a new dataframe called analytic_2analytic_2 = pd.merge(df_C , analytic_1, on="id" , how="right")analytic_2.head()
id
date
got
gpt
ldh
alp
tp
alb
ua
un
...
admission
diagnosis_x
examination_date
acl_igg
acl_igm
ana
ana_pattern
acl_iga
diagnosis_y
thrombosis
0
14872
1981/02/17
22
30
179
41
6.6
4.1
3.5
13.0
...
+
MCTD
1997/5/27
1.3
1.6
256
P
0.0
MCTD, AMI
1
1
14872
1981/03/16
22
19
155
37
7.2
4.5
3.2
12.0
...
+
MCTD
1997/5/27
1.3
1.6
256
P
0.0
MCTD, AMI
1
2
14872
1981/04/06
21
20
143
42
7.5
4.7
4.1
12.0
...
+
MCTD
1997/5/27
1.3
1.6
256
P
0.0
MCTD, AMI
1
3
14872
1981/04/27
32
20
218
43
7.2
4.5
3.5
9.0
...
+
MCTD
1997/5/27
1.3
1.6
256
P
0.0
MCTD, AMI
1
4
14872
1981/05/01
33
19
277
38
6.9
4.4
3.3
11.0
...
+
MCTD
1997/5/27
1.3
1.6
256
P
0.0
MCTD, AMI
1
5 rows × 38 columns
The main reason a right join was done is to show all the lab results of patients who have been tested for Thrombosis and have a result (0,1,2, or 3). Analyzing lab results of patients that do not have any logged information in analytic_1 is useless and would not further our understanding of thrombosis.
analytic_2 showcases another major issue with the data provided, there is a very big time jump between the last lab test and the examination date for thrombosis, making it very tricky to associate any lab result with whether or not a person would get thrombosis. That being said, assuming that this difference doesnt matter would simplify things for the time being.
5 Exploratory Analysis
The firts step in conducting a comprehensive analysis of thrombosis, it is important to gain a deeper understanding of its associated side effects and prevalence among different age groups and genders within the population.
Below is a density plot that shows the age distribution of people that have thrombosis vs people who do not.
Note: Thrombosis level 1,2, and 3 are all combined in the below plot into one group.
Code
# Create an age_at_examination column # Fill NA values with 0 that will be dropped later onanalytic_1['age_at_examination'] = (np.round(((pd.to_datetime(analytic_1['examination_date']) - pd.to_datetime(analytic_1['birthday'])).dt.days /365), 0)).fillna(0).astype(int)thrombosis_patients = analytic_1[analytic_1['thrombosis'] !=0]no_thrombosis_patients = analytic_1[analytic_1['thrombosis'] ==0]age_t = thrombosis_patients['age_at_examination']age_nt = no_thrombosis_patients['age_at_examination']# Filter outliersage_t = age_t[age_t >2]age_nt = age_nt[age_nt >2]# Age distribution plotsns.kdeplot(age_t, color="#f44336",fill=True, label="Thrombosis + patients")sns.kdeplot(age_nt, color="#979797",fill=True, label="Thrombosis - patients")# Add legendplt.legend()plt.title("Age Distribution of Thrombosis + vs Thrombosis - patients")plt.xlabel("Age")# Display plotplt.show()
The density plot above shows that Thrombosis does not seem to occur with older/younger people. In fact, most of Thrombosis patients’ age are in the range 20-60, which is pretty much the same as that of petients that did not have Thrombosis. The only slight difference is that patients with no thromboses are slightly skewed to the younger side of that range, but ultimately i dont think this holds any relvance.
Next step is to see if there are any differences within the Thrombosis group, meaning between the different levels of the disease’s severity.
Different levels of Thrombosis turn out to have different age distributions, with the age density plot of level 3 patients being the most condensed in the 35-60 age range, while that of level 1 and level 2 patients being less.
Next is an analysis on gender
Code
# calculate male and female counts for each datasetthrombosis_counts = thrombosis_patients['sex'].value_counts()analytic_counts = analytic_1['sex'].value_counts()# calculate male and female proportions for each datasetthrombosis_proportions = thrombosis_counts / thrombosis_counts.sum()analytic_proportions = analytic_counts / analytic_counts.sum()# Define the datagender_data = {'Female': thrombosis_proportions.values[0], 'Male': thrombosis_proportions.values[1]}# Convert data to a pandas dataframegender_df = pd.DataFrame(gender_data, index=[0])# Create a stacked bar chartgender_df.plot(kind='bar', stacked=True, figsize=(8,6), color=['#f44336', '#979797'])# Add labels and title to the chartplt.xticks(gender_df.index, ['Gender'])plt.ylabel('Percentage of Cases')plt.title('Thrombosis Cases by Gender')# Show the chartplt.show()
At first glance, it appears that Thrombosis is dominantly a Female based disease, where the bar chart shows that around 90% of patients in the database that had Thrombosis where females. However, this information is extremely misleading without looking at the Gender distribution in the data as a whole (not only Thrombosis + patients).
Code
import pandas as pdimport matplotlib.pyplot as plt# calculate male and female counts for each datasetthrombosis_counts = thrombosis_patients['sex'].value_counts()analytic_counts = analytic_1['sex'].value_counts()# calculate male and female proportions for each datasetthrombosis_proportions = thrombosis_counts / thrombosis_counts.sum()analytic_proportions = analytic_counts / analytic_counts.sum()# create a stacked bar chartfig, ax = plt.subplots(figsize = (8,8))ax.bar(['Thrombosis', 'Dataset'], [thrombosis_proportions['M'], analytic_proportions['M']], label='Male', color='#979797')ax.bar(['Thrombosis', 'Dataset'], [thrombosis_proportions['F'], analytic_proportions['F']], bottom=[thrombosis_proportions['M'], analytic_proportions['M']], label='Female', color='#f44336')max_val =max([thrombosis_proportions.sum(), analytic_proportions.sum()])ax.set_ylim([0, max_val])ax.set_ylabel('Proportion')ax.legend()plt.show()
As it turns out, the dataset itself is made up of almost all exclusive female patients, which is why Thrombosis appears to be a female dominant disease. This highlights a huge flaw and imbalance in the data, as there is no objective way to analyze whether gender is significant in the Thrombosis conversation.
Next, understanding the characteristics/side effects of Thrombosis is a stepping stone towards finding potential predictors of Thrombosis. I will start with analyzing the lab test results (those of df_B).
Look at box plots to check for outliers.
Code
# Subset to columns of interestlab_results = analytic_1[['acl_iga' , 'acl_igg', 'acl_igm', 'ana', 'thrombosis']]lab_results = lab_results.dropna(subset=['ana']) # Only 4 missing values lab_results['ana'] = lab_results['ana'].astype('int')# create subplotsfig, axs = plt.subplots(2, 2, figsize=(8, 8))# create box plots# Find the rows containing '>4096'cols_to_normalize = ['acl_iga' , 'acl_igg', 'acl_igm', 'ana']mask = (lab_results[cols_to_normalize] =='>4096').any(axis=1)# Drop the rows containing '>4096'lab_results = lab_results.drop(index=lab_results[mask].index)boxp = lab_results[['acl_iga' , 'acl_igg', 'acl_igm', 'ana']].astype(float)axs[0, 0].boxplot(boxp['acl_iga'])axs[0, 0].set_title('acl_iga')axs[0, 1].boxplot(boxp['acl_igg'])axs[0, 1].set_title('acl_igg')axs[1, 0].boxplot(boxp['acl_igm'])axs[1, 0].set_title('acl_igm')axs[1, 1].boxplot(boxp['ana'])axs[1, 1].set_title('ana')# add x-axis labelfor ax in axs.flat: ax.set_xlabel('Feature')# add y-axis labelfor ax in axs.flat: ax.set_ylabel('Value')# add overall titleplt.suptitle('Box plots of features', fontsize=16)# adjust layoutplt.tight_layout()# show plotplt.show()
The data has a couple of outlier issues.Setting the threshold at 100 for IgA, IgG, and IgM makes sense, while putting it at 200 for ANA also makes sense.
The data now is in a better state to analyze (a handful of outliers remain). Below is a radar chart that shows how patients that do not have Thrombosis compare to those who have (and the different levels) in terms of lab results (averaged results).
Code
# Normalize columns to have them all in the same scale from sklearn.preprocessing import StandardScalerscaler = StandardScaler()cols_to_normalize = [col for col in lab_results.columns if col !='thrombosis']lab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)lab_results_norm['thrombosis'] = lab_results['thrombosis']# Get the means of each thrombosis groupmeans = lab_results_norm.groupby('thrombosis').mean().reset_index()categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']variable1 = means.iloc[0, :].valuesvariable2 = means.iloc[1, :].valuesvariable3 = means.iloc[2, :].valuesvariable4 = means.iloc[3, :].valueslabel_loc = np.linspace(start=0, stop=2* np.pi, num=len(variable1))# Set colors and fillcolors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']plt.figure(figsize=(8.5, 8.5))plt.subplot(polar=True)plt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])plt.plot(label_loc, variable1, color=colors[0], label='No Thrombosis', linewidth=2)plt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])plt.plot(label_loc, variable2, color=colors[1], label='Thrombosis 1', linewidth=2)plt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])plt.plot(label_loc, variable3, color=colors[2], label='Thrombosis 2', linewidth=2)plt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])plt.plot(label_loc, variable4, color=colors[3], label='Thrombosis 3', linewidth=2)plt.title('Lab results averages of patients', size=24)lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[''] + categories, color='gray')plt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)plt.tight_layout()plt.show()
The radar chart shows that the only lab results that show significant differences between different Thrombosis groups are:
ANA: The higher the ANA value the more severe the Thrombosis diagnosis.
ACL_IGG: Thrombosis level 1 seems to have, on average, the highest acl_igg values among all groups. The fact that people with no Thrombosis and Thrombosis levels 2 and 3 have pretty close acl_igg values is weird but nevertheless, it shows some potential predictive power for Level 1 diagnosis.
It will be important to to the same kind of analysis with the other lab results in df_C later on.
A possible predictor of Thrombosis could be the symptoms a patient is experiencing.
Code
from wordcloud import WordCloud, STOPWORDSimport randomimport matplotlib.pyplot as plt# Read original dataset againB = pd.read_csv("data/TSUMOTO_B.csv" , encoding='ISO-8859-1')# Define the thrombosis categories to iterate overthrombosis_categories = [1, 2, 3]# Create a figure with three columnsfig, axes = plt.subplots(1, 3, figsize=(9, 6))# Iterate over the thrombosis categories and plot a word cloud in each columnfor i, thrombosis_category inenumerate(thrombosis_categories): thrombosis_p = B[B['Thrombosis'] == thrombosis_category]# Concatenate all the strings in the "Symptoms" column into a single string text =' '.join(thrombosis_p['Symptoms'])# Create a set of stopwords stopwords =set(STOPWORDS) stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ','ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY']) # Excluding thrombsosis (obvious word) + some words that dont make sense# Define the color palette custom_colors = ['#808080', '#F44336']def color_func(word, font_size, position, orientation, random_state=None, **kwargs):return random.choice(custom_colors)# Create a WordCloud object with the custom color palette wordcloud = WordCloud(width=400, height=400, background_color='white', color_func=color_func, stopwords=stopwords).generate(text)# Plot the word cloud in the corresponding subplot ax = axes[i] ax.imshow(wordcloud) ax.axis("off") ax.set_title(f"Thrombosis {thrombosis_category}")# Adjust the layout of the figure and display itplt.tight_layout()plt.show()
The main side effects of each thrombosis level are the following:
Level 1: Brain infarction
Level 2: CNS lupus
Level 3: Thrombocytopenia
The world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer and fewer words as the level of Thrombosis is more severe.
The next step is understanding the relationship Thrombosis has with the general lab tests’ results and if there are any possible predictors over there.
The most rational thing to do would be analyzing if there are any correlations between the lab results and Thrombosis. Since we are dealing with correlations between numerical variables and a categorical variable, point-biseral will be used to get that correlation.
Note: Missing values will be replaced with the column’s mean. If more time was given for the project, maybe other data imputation methods could be have been explored.
Below is a correlation heatmap that visualizes the correlation of all lab results in the third database with “Thrombosis”. The rationale behind such a plot would be identifying which lab results are highly correlated with Thrombosis, indicating potential predictive power.
Assumption: All thrombosis levels were combnied to one category (2 –> 1 , 3 –> 1), since there are very few cases overall, building a predictive model one category of Thrombosis would make more sense.
Code
from scipy import stats# Drop duplicate id and keep last entry, which is the most recent lab testanalytic_2 = analytic_2.drop_duplicates(subset=['id'], keep='last')# Keep numeric columns to calculate pb correlationnumerical_cols = ['got', 'gpt', 'ldh', 'alp', 'tp', 'alb', 'ua', 'un','cre', 't-bil', 't-cho', 'tg', 'cpk', 'wbc', 'rbc', 'hgb', 'hct', 'plt','u-pro', 'c3', 'c4','acl_igg','acl_igm', 'ana', 'acl_iga', 'thrombosis']# Fill NA with -1 to convert all columns to numeric (will be dropped later)for i in numerical_cols: analytic_2[i].fillna(value=-1, inplace=True)analytic_2_numeric = analytic_2[numerical_cols]analytic_2_numeric = analytic_2[numerical_cols].apply(pd.to_numeric, errors='coerce')analytic_2_numeric = analytic_2_numeric.astype(float)analytic_2_numeric = analytic_2_numeric.replace(-1, np.nan) # Replace -1 with NaNcol_avg = analytic_2_numeric.mean().astype(float)analytic_2_numeric = analytic_2_numeric.fillna(col_avg) # Fill NaN values with column averagesanalytic_2_numeric['thrombosis'] = analytic_2_numeric['thrombosis'].replace({2: 1, 3: 1}) # Combine all thrombosis cases into 1 categoryanalytic_2_numeric['thrombosis'] = analytic_2_numeric['thrombosis'].astype('int')columns = []correlations = []# Compute the point-biserial correlation coefficient for each numerical columnfor column in analytic_2_numeric.select_dtypes(include='number'):# Skip columns that have NaN valuesif analytic_2_numeric[column].isnull().values.any():print(f'Skipping column {column} because it has NaN values')continue# Compute the correlation coefficient and p-value, even if there are NaNs x = analytic_2_numeric[column].values y = analytic_2_numeric['thrombosis'].values mask = np.isfinite(x) & np.isfinite(y) x = x[mask] y = y[mask] corr, p_value = stats.pearsonr(x, y) columns.append(column) correlations.append(corr)correlationsdf = pd.DataFrame({'Variable':columns , 'Correlation with Thrombosis':correlations})# Visualize correlationsdf# create an array of valuesvalues = np.array(correlationsdf['Correlation with Thrombosis'].values)# create a colormapcmap = plt.cm.get_cmap('Reds')# create a figure and axis with larger sizefig, ax = plt.subplots(figsize=(10, 10))# plot the heatmap (transposed)heatmap = ax.imshow(values.reshape(1, -1).T, cmap=cmap)# remove the x-axis and set the y-axis ticks to the middle of each binax.set_xticks([])ax.set_yticks(np.arange(values.size))ax.set_yticklabels(correlationsdf['Variable'].values)# add colorbar legendcbar = fig.colorbar(heatmap)# show the plotplt.show()
The heatmap above clearly highlights how none of the lab result variables are significantly correlated with whether or not a person has Thrombosis, making almost all the dataset irrelevant when it comes to further understanding the disease. The only variables that have some sort of correlation are acl_igg and ana (both in the 0.2-0.4 range), and both come from df_B, making the third dataset even more useless.
6 Technical Summary
Suprisingly, the data provided for this analysis much lower in quality than expected, especially since it is coming from such a presigious institution. The main issues i found with the data provided are:
Missing Values: Missing values is a very serious issue with this data, especiialy in tsumto_c, which had almost half the columns/features in the dataset having more than 80% missing values. This issue was the main obstacle in terms of trying to further understand Thrombosis and its relationship with the general lab tests. There could be some important relationships and patterns that were not deciphered because of this issue.
Lack of Thrombosis cases: This research project’s main focus was understanding Thrombosis at a deeper level and uncovering patterns in the data that could serve has predictors to this disease. However, the data provided barely has any Thrombosis cases, especially more severe ones (levels 2 and 3), making any analysis very sensitive to the addition of larger number of cases.
Imbalanced Data: Dataset tsumoto_b was highly imbalanced, with almost all patients (around 85%) being females, which could lead to inaccurate conclusions that Thrombosis occurs much more frequently with females (When sampling to thrombosis + cases only is done). A proper analysis on gender could not be conducted because of this.
Dates: There is a huge difference between the dates patients are testing for thrombosis and their lab tests, with some of them having more than a 5 year difference. This makes any analysis on lab results (tsumoto_c) very weak considering the patients’ actual results would have been very different if they had more recent (closer to Thrombosis examination) lab tests.
However, it is important to mention that all the data was collected between the years 1989 and 1996 where the data collection methods were not the most advanced and discrepancies in data was common in the industry. A better idea would be conducting an analysis on thrombosis using more recent data if available.
In terms of main findings, they are the following (elaborated more in results.html):
Level 3 Thrombosis appears to happen with people on the higher end of the age distirbution.
ANA and aCl_igG seem to be the lab results that could have the most predictive power.
Different levels of Thrombosis are accompanied with different sets of symptoms.
7 Final Results
Figure 1: Distribution of age among thrombosis patients.
Figure 1 illustrates that the most severe cases of Thrombosis (Level 3) tend to occur with people in the age range 35-55, which is a bit more than the age ranges of people that have less severe cases of Thrombosis. This might indicate that the age of a patient could have a say of the Thrombosis level they are diagnosed with, but probably is not powerful enough to conclude if a person has Thrombosis or not. In other words, if patient A has Thrombosis, age might tell us whether he has level 1, 2, or 3 Thrombosis. However, if patient A comes in to test for Thrombosis without prior knowledge if they have the disease, then their age might not be indicative of anything.
Figure 2: Average lab results per Thrombosis level.
In terms of the special lab tests conducted at the Laboratory of Collagen Diseases, Figure 2 strongly suggests that ana and acl_igg are good indicators of whether a person has Thrombosis. The plot shows that the higher a patient’s degree of Thrombosis is, the higher that patient’s ANA values are (on average). As for aCL IgG, it is much higher for people with Level 1 Thrombosis than all other cases, with that value dropping as the degree of severity increases. If more tests were provided in a proper dataset, maybe additional relevant features could have been found.
Figure 3: Most prevalent symptoms per Thrombosis level.
Figure 3 shows that different levels of Thrombosis are accompanied by different sets of symptoms that are prevalent for each group. Specifically:
Brain infarction is the most dominant symptom for Thrombosis Level 1 patients.
CNS lupus is the most dominant symptom for Thrombosis Level 2 patients.
Throbocytopenia is the most dominant symptom for Thrombosis Level 3 patients.
Patients that come into a clinic with any of these symptoms could provide an initial guess as to which Thrombosis level they might have, however more data (with higher quality) is needed to validate this claim.
Figure 4: Correlation of lab test results with Thrombosis.
Finally, Figure 4 indicates that only 2 of the Laboratory tests that were provided data for are possible predictors of Thrombosis, ANA and aCL IgG. This figure uses different mathematical techniques to validate what was concluded by Figure 1, which further validates the idea that there could be lab tests being carried out that have the potential to predict the possibility of a patient getting Thrombosis.
8 Conclusion
Overall, this research project has highlighted two crucial points. Firstly, thrombosis can potentially be predicted using routine lab tests carried out at hospitals on a daily basis. If this claim is further examined and proven true, it could save thousands of lives from this severe disease. Secondly, despite the limited and low-quality data provided, this study has improved our understanding of thrombosis, and there is clear potential for even more discoveries.
Although my work does not provide a definitive solution for predicting or preventing thrombosis, it serves as a stepping stone towards securing proper funding, obtaining more and better quality data, and building a larger team of scientists to further explore the promising findings of this research.